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Abstract: Activity of rheumatoid arthritis (RA) can be evaluated using several scoring scales 
based on clinical features. The most widely used one is the Disease Activity Score involving 
28 joint counts (DAS28) for which cut-offs were proposed to help physicians classify patients. 
However, inaccurate scoring can lead to inappropriate medical decisions. In this article some 
methodological issues in the design of such a score and its cut-offs are highlighted in order to 
further propose a strategy to overcome them. As long as the issues reviewed in this article are 
not addressed, results of studies based on standard disease activity scores such as DAS28 should 
be considered with caution. 

Keywords: DAS28, disease activity score, penalized logistic regression, clinical prediction, 
modeling 

Introduction 

Rheumatoid arthritis (RA) is a systemic disease which occurs in about 1% of the 
world population and triggers joint inflammations that may worsen patients' quality of 
life. Nowadays, efficient disease-modifying antirheumatic drugs (DMARDS), such as 
Methotrexate, as well as targeted immunomodulating agents, are available to relieve 
patients. 1 "' In order to define treatment strategy and to evaluate response to therapy, 
disease activity may be measured via several scoring schemes, including the Disease 
Activity Score, involving 28 joint counts (DAS28), the Simplified Disease Activity 
Score (SDAI), and the Clinical Disease Activity Score (CDAI), 5 6 among others. All 
these measures consist of a weighted sum of bioclinical features, or a transform (eg, 
logarithm) of these, such as the number of tender joints in the hands, the number of swol- 
len joints in the hands, the erythrocyte sedimentation rate (ESR), the C-reactive protein 
(CRP) concentration, the patient global assessment, and the physician global assessment. 
Several authors proposed cut-offs for these scores to help physicians classify patients 
in a particular disease activity state, ranging from remission to high activity. 7 8 

Among all disease activity scores, DAS28 is the most widely used. In clinical 
practice, it may be used as a monitoring tool to define treatment strategy and to 
further adapt it during patients' follow up. 910 For example, Van der Cruyssen et al" 
investigated the potential of DAS28 to help decide on a dose increase of infliximab to 
improve response to treatment in RA patients. On the other hand, DAS28 can be used 
in clinical trials to evaluate disease improvement from baseline using the European 
League Against Rheumatism (EULAR) response criteria. 12-14 

A lot of detail on RA scores can be found in two very helpful reviews from Anderson 
et al published in 20 1 1 and 20 12. 56 Some of them have been compared several times. 1516 
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An interesting introduction to the use of DAS28 can also be 
found online at http://www.das-score.nl . 

However, one should be very careful about the design 
of such RA activity scores and cut-offs. Indeed, inaccurate 
evaluation of the RA activity using these scores may lead to 
inappropriate treatment administration or unreliable conclu- 
sions in clinical trials. For example, a DAS28 score lower than 
3.6 is often considered as evidence of low disease activity 
and thus a target to reach for the physicians. 17 However, if 
the initial design of DAS28 or corresponding cut-off does 
not allow accurate evaluation of RA activity, such a patient's 
disease could actually be classified as being in the moderate 
or high activity phase and would thus necessitate initiation of 
treatment. In clinical trials, change in DAS28 from baseline 
could then be an inappropriate measure of drug efficacy. 

In the current rheumatology literature, no article seems 
to have pointed out incorrect design of such scores as 
potential sources of misevaluation of RA activity yet. For 
this reason, methodological problems related to the devel- 
opment of RA activity scores and cut-offs are emphasized 
in this article. This study is devoted especially to DAS28, 
since it is the most widely used scoring system in clinical 
practice and research. The seminal papers of Prevoo et al 
in 1995 18 and Aletaha et al in 2005, 7 where DAS28 and its 
cut-offs for disease activity were constructed, are therefore 
discussed. A strategy to address these issues is then pro- 
posed in order to further improve RA activity scoring and 
thus patient care. 

Review of methodological issues 

Definition and evaluation of rheumatoid 
arthritis activity 

Before developing a score, clear guidelines designed by expert 
physicians are needed to evaluate disease activity in order to 
limit inter-physician variability" Such gold standards can 
be related, for example, to the level of physical impairment, 
the type of treatment needed, etc. However, it seems that no 
gold standard has yet been reached for describing disease 
activity in RA. For example, no consensus on the definition 
of remission has yet been reached. Remission can currently 
be confirmed when the DAS28 score is lower than 2.4, or 
confirmed independently using the 2010 American College 
of Rheumatology (ACR)/EULAR criteria. 7 17 20 Although both 
rules share common items, discrepancies exist and the need 
for guidelines has been pointed out. 21 Shaver et al already 
reported some inconsistencies using published cut-offs for 
remission of DAS28 and CDAI and therefore recommend 
cautious use of these with patients. 22 



The lack of guidelines was also illustrated in the study of 
Aletaha et al in 2005. 7 In this article, 35 experts had to judge 
the disease activity state of 32 RA patients. No reference 
explaining how patients were rated by the experts was given, 
although objective criteria had been clearly established when 
setting up DAS28 earlier in Prevoo et al's paper. 18 Only two 
of these were unanimously classified in the same disease 
activity category by every expert: those of lowest and highest 
disease activity. Over the whole sample, the mean percentage 
of judges classifying a given patient into a group other than 
the majority reached 28.42%. Even if the proposed statistical 
analysis tried to smooth the inter-expert variability by aver- 
aging the expert specific cut-offs, it seems then somewhat 
illusory to hope that precise cut-offs will help to classify 
patients when experts in the field experience some difficulty 
reconciling their judgments. 

Besides this, a scoring scale has to replace a gold standard 
that is difficult or expensive to measure directly. It has to rely 
on other features than the ones used to define the guidelines 
and to offer a comparable efficiency. 19 For example, if the 
number of tender joints was used by the 35 experts as the 
reference test to assess disease activity, then including it in 
a score such as DAS28 becomes redundant. 

Sampling from database 

Inclusion criteria defining target patients should be clearly 
defined in order to build a database. For example, in the 
study of Aletaha et al, 7 it was not specified if patients met 
the inclusion criteria demanded when developing DAS28 
in the original study of Prevoo et al. 18 Although the ACR 
criteria were respected, it was not specified if patients had 
not been treated previously with DMARDS and if the 
disease duration did not exceed one year, as required in 
Prevoo et al's article. 18 This leads to a lack of comparability 
between studies. 

Moreover, DAS28 was initially designed with 227 RA 
patients sampled from a longitudinal, hospital-based database 
to distinguish between high and low disease activity phases 
using Canonical Discriminant Analysis (CDA). 18 In this study 
patients went through several disease activity periods, among 
which only two were randomly selected. As a result, a given 
patient could have contributed to both high and low disease 
activity groups. This leads to a violation of the assumption 
of independence in the statistical analysis, which in turn 
leads to erroneous coefficients estimates of the features in 
CDA and thus to an incorrect score. This constraint could 
have been accounted for using mixed models. 23 Furthermore, 
the longitudinal aspect of the disease activity phases could 
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also be very informative in predicting a novel activity phase. 
One may then wonder why the complete database was not 
used and why only two phases per patient were kept in the 
analysis. 

Design of the score 

The scoring objective (diagnosis, prognosis, choice of 
treatment, etc) should help the practitioner define which 
variables to integrate in the model. For example, some 
authors evocated the possibility of adding extra variables 
as imaging to DAS28 in order to predict remission. 24 Others 
raised the issue that DAS28 does not take into account age 
and sex, 25 nor the number of swollen and tender joints in 
the feet. 26 

Furthermore, sample size restrains the number of 
predictors to be used. Indeed according to Whitehead 27 
when regressing an ordinal outcome with q categories with 
a dataset of size n, the maximal number of predictors should 
be lower than ml 1 0 to avoid overfitting, where 

m = n-\l.l_ i nl (1) 
n 

n being the frequencies of each category). 
For example, using Aletaha's sample 7 (remission: n=6, 
low activity: n 2 =13, moderate activity: n=9, high activity: 
n=4), no more than two to three predictors should have been 
included to build the score. 

Definition of disease activity 
categories and design of the cut-offs 

Predictive ability, ie, the propensity of the score to recover 
the disease activity state of patients, is the primary objective 
of defining classification rules. For example, three methods 
were proposed by Aletaha et al in 2005 7 to define the cut-offs 
of DAS28 and SDAI and to classify patients in "remission" 
versus "low to high activity", "remission to low activity" 
versus "moderate to high activity", and "remission to 
moderate activity" versus "high activity". In the first method 
upper quartiles were used as optimal cut-offs, although they 
does not maximize any classification performance. The sec- 
ond technique relied on maximizing the /("statistic, which is 
more a measure of agreement between two rankings than a 
measure of performance. Conversely, the third approach, rely- 
ing on ROC (receiver operating characteristic) curves, did not 
suffer from these drawbacks. Besides, DAS28 was originally 
developed as a discriminative tool for a two-level disease 
activity state, which was precisely defined by the frequency 



of DMARDS administrations. Then, disease activity was 
defined as an ordinal outcome with four levels: remission; low 
activity; moderate activity; and high activity. Defining four 
states and cut-offs from a score designed primarily as a two- 
level disease activity measurement 18 seems suboptimal. New 
scores and cut-offs should be built to have the same number 
of categories as the gold standard they are replacing. 

Validation 

Developing complex scores and cut-offs, especially when the 
features are numerous and the sample size is low, can lead to 
overfitting problems. This means that model performances 
may decrease when applied to other datasets. In Aletaha's 
study in 2005, 7 no actual statistical validation step had been 
proposed to address this issue. The cut-offs were validated by 
applying them to other datasets and by showing a significant 
increase of surrogates of disease activity (like the Health 
Assessment Questionnaire [HAQ] functional index, which 
also combines information on damage and comorbidity) 
across the categories defined by those cut-offs. Proper sta- 
tistical validation procedures are needed to avoid developing 
scores with over-optimistic predictive ability. 

Possible guidelines to define a 
disease activity score and cut-offs 

In this section a relevant strategy is proposed to define a tool 
to be used in assessing disease activity in RA patients, and 
to address previous issues. 

Patients 

Patient inclusion criteria should be clearly defined (eg, ACR 
criteria, DMARDS administration, disease duration, etc). 
This would clarify which patients are on target and ensure 
comparability between studies. 

Disease activity 

A gold standard to assess disease activity should be defined 
by expert physicians. They need to define the number of 
disease activity groups in which patients can be classified 
and how to do it. Guidelines were recently published about 
panel diagnosis. 28 

Surrogate variables 

If this gold standard is considered too complex or too expen- 
sive to manage, physicians should review possible surrogate 
variables to replace the ones used in the gold standard. In 
that case, constructing a disease activity score using these 
variables is relevant. 



Clinical Epidemiology 2014:6 



submit your manuscript | www.dovepress.com 
Dovepress 



223 



Collignon 



Dovepress 



Study design 

For example, cross-sectional studies may be considered for 
diagnosis purposes, whereas cohort studies are preferred 
when prognosis is the aim of the score. 19 

Statistical analysis 

A complete approach of clinical prediction models can be 
found in Steyerberg's 2009 publication. 19 

Estimation of disease activity state probabilities 

Any classification procedure can be used to predict disease 
activity states using patient's features x v ...,x, such as forward 
continuation ratio ordinal logistic regression. 29 The probabili- 
ties of disease activity states DA can be modeled as follows: 

P(DA = k\DA>k,x,,...,xJ = -z (2) 

where a k represents the baseline disease activity state of 
category k, varying from 1 to q. The parameters /J 15 ...,/? are 
the coefficients of the features and measure their contribution 
to disease activity. Parameters are estimated by maximiz- 
ing the likelihood of the model. Forward continuation ratio 
ordinal logistic regression is often seen as a discrete Cox 
survival model. 

Variable selection 

The most relevant predictors should be chosen using a vari- 
able selection scheme to avoid over-estimating their effects 
on disease activity state. Measuring only a limited number 
of predictors will also make the model more robust, easier to 
use, and cheaper. 

In order to remove irrelevant variables from the predictive 
model, the penalization technique can be performed using the 
V criterion. This widely-used method has several advantages 
over stepwise selection methods. 29 Only the relevant features 
receive a non-zero coefficient p. in the logistic regression 
formula and are further integrated in the score. To do so the 
following quantity has to be maximized: 

(3) 

1=1 

where L(a x , a ,j3 Y , . . . , /3 ) is the likelihood of the logistic 
regression and X is a parameter that controls the amount of 
penalization. The optimal X value is searched within a pre- 
specified grid to optimize the Akaike information criterion 



(AIC) by crossvalidation. 29 As an example, Hirata et al used 
penalization to define a disease activity score using serum 
biomarkers. 30 

Development of a score and associated cut-offs 

The term S = P x x v + •■• + Px in logistic regression can 
be used as a score itself, although computation of the 
probabilities of the RA activity categories allows classifying 
patients directly. Note that some variables might not appear 
in the score if they have been eliminated during the variable 
selection step. 

Development of associated cut-offs 

If for example two categories "low activity" and "high 
activity" are desired (q=2), the cut-off can be selected using 
criteria based on sensitivity and specificity. 19 However, more 
categories can be defined. For example, if four categories 
are desired cut-offs can be defined by analyzing every pos- 
sible triplet of score values (c p c 2 , c 3 ), such as the disease 
activity state is predicted as "remission" if S ^ c v "low" if 
c, < 5 < c 2 , "moderate" if c 2 < S < c 3 and "high" if S > c y 
The set of triplets has to optimize a performance measure 
like the correct classification rate, as defined in the follow- 
ing paragraph. 

Predictive ability 

The performance of the classification technique may be 
evaluated with the correct classification rate, which is the 
percentage of patients whose disease activity state is correctly 
predicted by the logistic regression formula or using the 
cut-offs. The C classification index, 29 which is an equivalent 
of the area under the ROC curve, is another widely used 
discrimination index. 

Model validation 

External validity has to be assessed by collecting a blinded 
independent dataset meeting the criteria enounced in 1) of 
"Possible guidelines to define a disease activity score and 
cut-offs". This test set is classified according to the predic- 
tive model built previously. It is then unblinded to evaluate 
the performance of the model. If new data are unavailable, 
internal validation techniques such as bootstrap may be used 
to correct original results for optimism. 29 Bootstrap consists 
in drawing about 100 to 150 random samples of the same 
size from the actual dataset, with replacement. This means 
that these samples have the same number of observations that 
the original dataset had and one observation can be selected 
several times in the same bootstrap sample. All modeling steps 
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are processed again on each bootstrap sample to compute 
correct classification rates and C indexes. The classification 
techniques are then applied to the original dataset as if it 
were an independent test set. Optimism is calculated as the 
difference between the classification indexes obtained with the 
bootstrap sample and with the test sample. Optimism is finally 
averaged across all bootstrap samples and subsequently sub- 
tracted to the original classification indexes computed with the 
actual dataset. To sum up, if we denote by Q any performance 
measure of the model built on the original data, b the number 
of bootstrap samples, Q booli the performance obtained when 
re-regenerating the model with the z-th bootstrap sample, Q 
the performance obtained when applying it to the original data, 
the optimism corrected performance Q can be computed 
using the following formula: 

Conclusion 

In this article, some methodological issues in develop- 
ing an RA activity score and its cut-offs are reviewed 
and addressed. Particular attention is devoted to DAS28, 
although most of the comments could be applied to SDAI 
and CDAI since they are direct modifications of DAS28. 
Despite its limits, DAS28 is widely used in clinical trials 
and for treatment monitoring. However, developing a new 
score following the guidelines proposed in this article could 
offer an alternative tool to accurately measure RA activity 
and could thus improve patients' health care. Moreover, it 
is now very important to define a gold standard to evaluate 
RA activity, to collect reliable data, and to apply a relevant 
methodology in order to develop a valid bioclinical score 
to assess RA disease activity states. Indeed, inappropriate 
medical decisions such as treatment administration could 
be the result of an inaccurate score. Meanwhile, results of 
studies based on classic disease activity scores should be 
considered with caution. 
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